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' Abstract 

One of the most utilized data mining tasks is the search for association rules. Association rules repre- 
sent significant relationships between items in transactions. We extend the concept of association rule to 
represent a much broader class of associations, which we refer to as entity-relationship rules. Semantically, 
entity-relationship rules express associations between properties of related objects. Syntactically, these 
rules are based on a broad subclass of safe domain relational calculus queries. We propose a new defi- 
■ nition of support and confidence for entity-relationship rules and for the frequency of entity-relationship 

' queries. We prove that the definition of frequency satisfies standard probability axioms and the Apriori 

O . property. 



^ , 1 Introduction 

QQ I One of the goals of data mining is to discover interesting relationships from data. Association rules express 

i relationships that hold with sufficient frequency but not always. For example, it may be the case that not 

04 ' all managers earn over $60,000 a year, but that 90% of managers do. The logical form of an association rule 

is that of an implication p ^ q where p and q hold together sufficiently often (the "support" of the rule) 
and q holds sufficiently often given that p holds (the "confidence" of the rule). The traditional concept of 
. association rules severely limits the complexity of the expressions p and q and thereby limits the class of 

relationships a data miner can capture. Essentially, p and q may be only simple conjunctions, like an itemset. 
Thus we cannot have rules based on Boolean combinations, such as negations or nested combinations. An 
k> ■ example of a relationship involving a negation would be a negative factor, such as "students who have not 

. taken an introductory database course do poorly in datamining courses" . An example of a nested Boolean 

' combination would be "students who are math majors or computer science majors, and who have done well 

in a discrete mathematics course or in an algorithms course, do well in complexity theory" . Another class 
of relationships that association rules cannot express involves quantification and relating objects to each 
other. An example would be the rule "residents who have a neighbour with high incomes tend to have a 
high income themselves" . 

The goal of this paper is to extend the concept of an association rule to a large class of expressions 
that we refer to as entity-relationship queries (ER queries). Intuitively, entity-relationship queries express 
dependencies among entities and their properties. Entity-relationship queries are a large subclass of the safe 
queries. Safe queries correspond to an expressive subset of first-order logic that allows for nested Boolean 
expressions and quantification. We provide a definition of the frequency of an entity-relationship query. This 
extends the notion of an association rule to implications of the form p ^ q where p A g is an ER query; we 
refer to rules of this form as entity-relationship rules. From our definition of the frequency of an ER query 
we immediately obtain a definition of the support of an ER rule, namely the frequency fr{p A q). 
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TV-Program(Prog-Name:string) 
TV-Station( Station-Name :string. Area:integer) 

WeekdayTV(TV-Program:string,TV-Station:string,Viewers:integer,Sponsor:string) 
WeekendTV(TV-Prograni:string,TV-Station:string,Viewers:integer,Sponsor:string) 

Table 1: A relational schema for a TV survey model. Key fields are underlined. The schema lists TV 
programs and stations, and records for each combination of weekday program and station, how many viewers 
view the program on that station, and who sponsors the program. The same information is recorded for 
weekend programs. 

Our definition of frequency for ER queries generalizes previous work on defining association rules in a 
multi-relational setting. [Tj discusses extending itemset rules with negations and motivates the usefulness of 
this extension. The query extension approach of the Warmr system [3] presents a special class of entity- 
relationship rules that allows conjunctions of nonnegated statements and existential quantification. Our 
concept of ER rules features in addition negations, universal quantification, nested quantifiers, and nested 
Boolean combinations. Thus one contribution of this paper is an extended rule format. A characteristic that 
distinguishes our approach from previous work is that previous approaches assume a given target table that 
defines a base set of tuples for evaluating the support of a query. In contrast, we start with a query and 
define a natural base set of tuples for evaluating the support of the query. We can think of this approach as 
dynamically generating entity sets for a given query rather than evaluating queries with respect to a fixed 
entity set. Thus the second main contribution of this paper is a new definition of support for rules in our 
extended format. 

The paper is organized as follows. First we review basic relational database concepts such as the relational 
schema and the domain relational calculus. Then we introduce the concept of an entity query and define the 
frequency of a query in this class of queries. This definition provides the basis for the notion of an entity- 
relationship rule and for defining the support of an entity-relationship rule. We compare entity-relationship 
queries to frequent itemsets and to the rule language of the Warmr system. The final section establishes 
sevveral important formal properties of query frequencies as we define them and shows that they satisfy the 
Apriori property, that is, the frequency of a conjunction is no greater than the frequency of its conjuncts. 

2 Entities in the Domain Relational Calculus 

This section presents standard background material from database theory. The first subsection reviews 
relational schemas, and introduces the new concept of an entity field. Semantically, entity fields are those 
that store values (constants) that refer to entities. The second subsection defines the standard notion of a 
safe query in the domain relational calculus, and the third introduces a subclass of safe queries that we term 
entity-relationship queries. 

2.1 Entities in Relational Schemas 

We begin with a standard relational schema containing a set of tables, each with key fields, descriptive 
attributes, and possibly foreign key pointers. We use the notation T to refer to a generic table that may 
represent either an entity set or a relationship set, and for an index we use T^. A field named name in 
table T is denoted by T.name. Table [1] shows a relational schema for a TV survey database; this example is 
adapted from fS', Sec. 2]. Tables [2H1] display relation instances for the TV survey schema. 

We assume that the tables in the relational schema can be divided into entity tables and relationship 
tables. This is the case whenever a relational schema is derived from an entity-relationship model (ER 
model) [HI Ch.2.2]. Intuitively, an entity table corresponds to a type of entity, and a relationship table 
represents a relation between entity types. In our TV survey example, there are two types of entities: TV 
programs represented in the TV-Program table, and TV stations represented in the TV-Station table. We 
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TV-Program 


TV-Station 


Viewers 


Sponsor 


Gilmorc 


Global 


10 


Avon 


Gilmorc 


CBS 


12 


La Scnza 


Hockey Night 


CBC 


20 


RBC 



Table 2: Television Survey: Weekday TV. 



TV-Program 


TV-Station 


Viewers 


Sponsor 


Gilmore 


Global 


8 


Avon 


Hockey Night 


CBC 


14 


Schwab 


Simpsons 


CBS 


10 


RBC 


Daily Show 


CBC 


6 


La Senza 



Table 3: Television Survey: Weekend TV. 



now introduce two assumptions concerning the relational schema that facilitate the definition of entity- 
relationship queries and their frequencies. 

Unary Key Assumption We assume that every entity table has a single key field. 

The advantage of the unary key assumption is that given this assumption, a single key field in the 
relational schema refers to a single entity. The assumption holds in our TV survey schema because the two 
entity tables have key fields TV-Program. Prog-Name and TV-Station. Station-Name respectively. Although 
it is not always natural to define entities with a single key field, there is no loss of generality because we can 
always form a single composite key field from a list of key fields. For example, if in a Professor table there 
are two key fields FirstName, LastName, we can form a composite key field (FirstName, LastName). Our 
second assumption is the following. 

Global Name Assumption We assume that for every entity e, there is a unique constant c such that in 
every table, the constant c denotes entity e. 

The global name assumption is important because it allows us to recognize when the same entity occurs 
in different tables. In the AI literature, a similar assumption is often referred to as the "unique name 
assumption" [3 Ch.l4]. The assumption does not amount to a loss of generality because if the same 
constant c is used in different tables to refer to different entities, we can simply index c to distinguish these 
occurrences. For example, if we have two different transaction tables Transactionl and Transaction 2, and 
there is a transaction 1 in both, we could change the entry in the first table to refer to 1-1 and in the second 
table to refer to 1-2. A natural alternative to indexing constants would be to adopt a convention to the 
effect that a key field T.key in table T refers to different entities than key field T' .key in table T' if and only 
if the names of the key fields in the two tables are different. For example, if we have a table for Employees 
and another for Managers, labelling the key field in each table as "ssn" indicates that a given social security 
number refers to the same person no matter where it appears. In contrast, labelling the key field in the 
Transactionsl table "Tl-number" and the key field in the Transactions2 table as "T2-number" indicates 
that the transaction numbers in different tables refer to different transactions. 



Station-Name 


Area 


Global 


1 


CBS 


2 


CBC 


3 



Table 4: Television Survey: Stations and Areas. 
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Symbol Type 


Notation 


Comment 


Constants 


Ci,C2, ... 


At most countably many constants 


Predicate Symbols 




Exactly one predicate for each table T* 


Logical Symbols 


3,V,A,V,- 




Comparison Operators 


= ,<,>,<,>, 7^ 





Table 5: The Basic Vocabulary of our DRC language for a given database schema T> with tables , ...,T^ . 

In many applications, the global name assumption is enforced through foreign key constraints. To 
illustrate, in the TV example, we may suppose that the field WeekdayTV. TV-Station is a foreign key pointer 
to the field TV-Station. Station-Name, and that the field WeekendTV. TV-Station is a foreign key pointer to 
the same field. So the string constant "CBS" refers to the CBS network represented in the TV-Station table, 
whether "CBS" appears in an instance of the WeekdayTV relation or in an instance of the WeekendTV 
relation. 

Given the unary key and global name assumptions, the following is a valid definition of how tables, key 
fields and constants are associated with entities. 

Definition 1 Let V be a database instance. 

1. An entity table is a table T with a single key field. 

2. An field is an entity field if (1) the field is the key of an entity table, or (2) the field is a foreign 
pointer to the key of an entity table. 

3. A constant c is an entity constant if c appears in an entity field. 

Examples. Let V be the TV survey database instance from Tables [SHU The entity keys are TV- 
Program. Prog-Name, TV-Station. Station-Name, Weekday TV.TV-Program, Weekday TV.TV-Station, Week- 
endTV.TV-Program, WeekedTV.TV-Station. Entity constants include "CBS" and "Simpsons" . 

Next we review the domain relational calculus, which is a logical query language based on a given 
relational schema. 

2.2 Safe Queries in the Domain Relational Calculus 

We first define the formal language of the domain relational calculus, including the well- formed formulas of 
the calculus. Then we define an important subclass of formulas known as safe queries. Our presentation 
follows the standard approach, see for example ^8, Ch.3]. 

2.2.1 The Formal Language of the Domain Relational Calculus 

In the domain relational calculus (DRC), for every table in the database schema there is exactly one 
predicate Pi in the logical language. The number of fields in the table is the arity of the predicate Pi. If 
T' is an entity table, then Pi is an entity predicate. By the unary key assumption, an entity table T* has 
a single key field; we adopt the convention that the key field is the first argument in the entity predicate Pi . 
The complete logical vocabulary of the DRC is listed in Table [S] 

Example. In the TV survey model, we have the predicates shown in Table |6l 

Thus we may write WeekdayTV { "GilmoreCirls" , "CBS" ,12, "La Senza" ) to assert that "Gilmore Girls" is 
shown on "CBS" on weekdays, with 12,000 viewers, and sponsored by La Senza. The notion of a well-formed 
formula is the usual one for this vocabulary. 

Definition 2 Well-Formed Formulas of the Domain Relational Calculus 
1. A constant c or variable X is a term. 
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Table 6: Predicates of our Logical Query Language for the TV survey model. 



Predicates 


Arity 


TV-Program(PN) 


1 


TV-Station(SN,A) 


2 


WeekdayTV(PN,SN,V,S) 


4 


WeekendTV(PN,SN,V,S) 


4 



Table 7: Examples of Valid Expressions for the database schema for the TV survey. 



Expression 


Type 


V" > 10 


atomic formula with V free 


3S3SN3V.WeekdayTV{P, SN, V, S) 


quantified formula with P free 


3S3SN3V.WeekdayTV{P, SN, V,S)AV> lOA 
3S3SN3V.WeekendTV{P, SN, V,S)AV>10 


conjunction of 
quantified formulas 



2. If P is a predicate symbol of arity k and ti, ..,tk are k terms, then P{ti, ■■■,tk) is an atomic formula. 

3. Ift,t' are two terms, then a comparison ti9t2 is an atomic formula. 

4- If F is a formula and X is a variable, then -iF, 3X.F,yX.F are formulas. 

5. If Fi and F2 are formulas, then so are Fi A F2 and Fi W F2. 

6. All formulas are formed by the repeated application of the previous rules. 

Examples. Table [7] gives examples of valid expressions and their types pertaining to the TV survey. 

We next define the result or output of a DRC query. The first step is to define what ground formulas 
are satisfied in a database instance T); a formula is ground if it contains no variables. The second step is to 
define which closed queries F with no free variables are satisfied in a database instance V; as usual in logic, 
we write V \^F. Let F[Xi/ti, ..,Xk/tk] be the formula that results from replacing all free occurrences of 
each Xi in F with the term ti. 

1. If t,t' are two constants, then V \= tdt' iff t9t' holds. 

2. 2? 1= Pi{ci, .., Ck) iff (ci, .., Cfc) is a tuple in table T\ 

3. V \^ FiV F2 iSV ^ Fi 01 V ^ F2; similarly V ^ Fi A F2 iS V ^ Fi a.nd V ^ F2; and V ^ -.Fi iff 
V]^ Fi. 

4. 2? 1= 3X.F iff there is a constant c in the DRC language such that V ^ F[X/c]; similarly V ^ MX.F 
iff for all constants c we have T) \= F[X/c\. 

Let F{Xi, ..,Xm) be a query with free variables Xi, ..,Xm. Then on database instance T> the query F 
returns the set of all tuples that make F true when substituted in F. Formally, we write 

tuplesj,{F) = {(ci, ...,c,„) : V ^ F[Xi/ci, ..X„Jcjn]}- 

This definition assumes that the constants in the language include all constants that appear in the 
database tables, which involves no loss of generality. 

Examples. Let 2? be the database instance from Tables [IHH Table [8] shows the results of our example 
queries for this database instance. 
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Query Formula F 


Result tuplesT>{F) 


Fi = 3S.3SN3V.WeekdayTV{P, SN, V,S)AV>10 


{ "Gilmore" , "Hockey Night" } 


F2 = 3S.3SN.3V.WeekendTV{P, SN, V,S)AV>10 


{ "Hockey Night" , "Simpsons" } 


F1AF2 


{"Hockey Night"} 



Table 8: Results of Query Formulas on the database instance T) from Tables [2H11 



Query Formula F 


Safe? 


Fi = 3S.3SN.3V.WeekdayTV{P, SN, V,S)AV>10 


yes 


F2 = 3S.3SN.3V.WeekendTV{P, SN, V,S)AV>10 


yes 


F1AF2 


yes 


F1VF2 


yes 




no 


^Fi V F2 


no 


Fi A 


yes 



Table 9: Examples of safe and unsafe queries for the TV survey database schema of Table [T] 
2.2.2 Safe Queries 

It is customary to restrict the set of formulas that may serve as free variables in a query ( "query variables" for 
short) to ensure that the result set of tuples satisfying query formulas are bounded and "domain-independent" 
[8j Ch.3.8]. To this end we adopt the notion of a safe query. The intuition behind this concept is that the 
results of safe queries should be restricted to selection conditions applied to (combinations of) tables in the 
database. For example, the query -^TV Program[X) with free variable X is not safe because the range of 
constants satisfying this query is not bound by any table in the database. The key idea in the definition of 
safe query is to conjoin a query formula F to a restriction of the form P A F where P is a basic predicate 
in the language and hence refers to a table in the database. As is well-known, the expressive power of safe 
queries is exactly equivalent to that of relational algebra ^ . Safe queries are formally defined as follows ^8] 
Ch.3.8]. 

1. Replace the \/X quantifier by -'3X^. 

2. Whenever V is used to connect Fi W F2, the two formulas have the same set of free variables. 

3. Consider any maximal subformula consisting of the conjunction of one or more formulas Fi A ... A F^. 
Then all variables X appearing free in any of the Fi must be limited as follows. The variable X must 
be free in some non-negated Fi satisfying one of the following conditions. 

(a) Fi is not a comparison. 

(b) Fi is X = c where c is a constant. 

(c) Fi is X = Y, and Y is limited. 

4. A -1 operator may apply only to a formula in a conjunction of the type discussed in the previous rule. 

Examples. Table [9] gives examples of safe and unsafe queries for the TV database schema from Table [TJ 
This completes our review of basic concepts from relational database theory. We now come to the 
restriction of safe queries to entity-relationship queries. 

2.3 Definition of Entity-Relationship Queries 

The basic idea behind our definition of an ER query is that free variables should be limited in such a way 
as to guarantee that they must refer to entities. Intuitively, an ER query is one whose free variables refer to 
entities. The precise definition is as follows. 



6 



Definition 3 Let V be a database instance. 

1. A variable X is an entity variable candidate for a DRC formula F if 

(a) X is not quantified over in any part of F 

(b) if an expression X9t appears in F, then 9 is ~ or and if t is a constant c, then c is an entity 
constant in V. 

(c) if an expression P(_, X, _) appears in F, then the argument position of X in P(_, X, _) is an entity 
field. 

2. A variable X is an entity variable for F if 

(a) X is an entity variable candidate for F , and 

(b ) if an expression X ~ Y or X ^ Y appears in F , then Y is an entity variable candidate. 

3. An entity-relationship (ER) query F for database instance V is a safe DRC query such that all 
the free variables in F are entity variables for F given V. 

Examples. Let V be the TV survey database instance from Tables [2]-|4l fn the formula 

3S.3SN.3V.WeekdayTV{P, SN, V,S)AV>10 

the variable P is an entity variable, and so the formula is an ER query. The formula 

3S.3SN.WeekdayTV{''Gi\moie Girls", S'TV, V, S) 

is safe but not an ER query because the free variable V is not an entity variable. 

3 The Frequency of Entity- Relationship Queries 

Our basic idea is that the limiting conditions in safe queries specify the domain from which values for a free 
variable X are to be drawn. Once the domain for the free variables is defined for a given formula F, we can 
take the frequency of the formula F to be the number of assignments to the free variables that satisfy the 
formula divided by the size of the domain for the formula. Safe queries are a natural class of queries for this 
approach because these queries specify the range from which result tuples may be drawn by restricting these 
results to subsets of tables in the database (cf . Section I2.2.2[) . 

The main issue in our definition concerns the correct domain for conjunctions or intersections. For a 
simple example, consider a database schema with two entity tables Professor and Customer. The query 
Professor{X) /\ Customer {X) returns entities that are both professors and customers. What should be the 
base domain for this query? If there are many more customers than professors, we may get quite different 
frequency counts if we take the base domain to be Professor than if we take it to be Customer. So neither of 
these seems the right choice. Intuitively the base domain should be a symmetric function of the two classes 
mentioned in the query. The two natural symmetric set-theoretic operations are intersection and union. If we 
take the intersection as the base domain, the frequency of conjunctions without further selection conditions 
is always 100%, which does not seem right. In particular for our ultimate goal of defining the support of 
association rules, this is unsatisfactory. Our proposal is therefore to use the union of the two entity sets 
involved in the conjunction. Another way to look at the union is that it represents a kind of closed world 
assumption: If Professors and Customers are the only entity types mentioned in the selection conditions of 
the query, then the members of these entity types are exactly the potential answers to the query. 

The closed world assumption is also the basis for our frequency definition for queries with negation. For 
example, consider a safe query such as Professor{X) A -iCustomer{X). Since Professors and Customers are 
the only entity types mentioned in this query, we take the base domain again to be the union of these two 
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Query Formula F, Reference Domain domxi{F, X) 


Fi=3S3SN3V.WeekdayTV{P, SN, V,S)AV>10 
dom-DiFi ,X)^ { "Gilmore" , "Hockey Night" } 


F2 = 3S.3SN3V.WeekendTV{P, SN, V,S)AV>10 
domT:>{F2, X) ={ "Gilmore", "Hockey Night" , "Simpsons" 


"Daily Show" } 


-F3 =-^1 A F2 




domx) {F^ ,X) — { "Gilmore" , "Hockey Night" , "Simpsons' 


, "Daily Show"} 


F4 = i^i V F2 

dom-D{Fi ,X)={ "Gilmore" , "Hockey Night" , "Simpsons" 


"Daily Show"} 






domj) (i^5 ,X)—{ "Gilmore" , "Hockey Night" , "Simpsons" 


"Daily Show" } 



Table 10: Reference Domains for various formulas in the TV survey database instance V from Tables [SMU 

sets. The fact that Professors are mentioned positively and Customers negatively does not make a difference 
to the base domain, but it does make a difference to the result of the query and hence to its frequency. 

On the basis of this proposal, we can now recursively assign a domain to an entity variable in a formula F 
given a database instance 2?. We begin with just one free query variable and then tackle the more complicated 
case of queries with more than one free variable. 

3.1 Definition of Frequency for Queries With One Free Query Variable 

We denote the base domain of an entity variable X in a query F relative to a database instance "D as 
doinT){F, X). As we think of variable X as referring to the domain domT>{F, X), we term dom'p{F, X) the 
reference domain of X in the context of query F. 

Definition 4 Let V be a database instance with ER formula F . 

1. If F is Pi(ti, ..,tk), and X occurs in F , then domx>{F, X) = TTx[tuplesx>{F)]. If X is not a free variable 
in F, then domT>{F,X) = 0. Here we think of tuplesx>{F) as a relation whose columns correspond 
to the free variables of F. For example, the query p{X,Y, Z) returns a relation with triples, and we 
can think of the first column as named X and the second as named Y . The expression tt refers to the 
projection operator of relational algebra (with elimination of duplicates). 

2. Let F be a single atomic comparison of the form Y6t where t is either a variable or a constant. If F 
is X = c, then domx){F, X) = {c}. Otherwise dom-piF, X) — 0. 

3. If F is -^G for some formula G, then dom-piF, X) — domx>{G, X). 

4. IfF isFiyF2 orFihF2, then domv{F,X) = domv{Fi,X)[J domv{F2,X). 

5. IfF is 3Y.G, where Y i^X, then domv{F,X) = domv{G,X). IfF is 3X.G, then domv{F,X) = 0. 

Examples. Let T> be the TV survey database instance from Tables [SHU Table [10] gives examples of 
reference domains for various ER queries. 

As this definition shows, we think of basic predicates as specifying the range from which entities are 
drawn. Conditions of the form p{ti, ...,X, ...tk) or AT = c we view as "direct bounds" that determine the 
reference domain of X. Variable equations of the form X — Y view as "selection conditions" that are 
applied after an entity has been specified. These do not affect the reference domain of X but only the result 
of the query. Another type of selection are restrictions on descriptive attributes, such as y > 10 in the 
queries in Table fTOl 

Now the frequency of an ER query is defined as follows. 
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Query Formula F 


Frequency frx>{F) 


Fi = 3S3SN3V.WeekdayTV{P, SN, V,S)AV>10 


1 


F2 = 3S3SN3V.WeekendTV{P, SN, V,S)AV>10 


1/2 


F1AF2 


1/4 


Fi\JF2 


3/4 


Fi A 


1/4 



Table 11: Frequencies for various formulas in the TV survey database instance V from Tables [2H11 



Deflnition 5 Let F be an ER query with free variable X such that dom-p^F, X) ^ 0. Then 

\tuplesT>{F)\ 



frviF) 



\domv{F,X)\ 



In Section [5] we establish several formal properties of the frequency of a query according to this definition, 
for example that the frequency is a number between and 1. 

Examples. Let V be the TV survey database instance from Tables O-Hl Table [TT] ilUustrates the frequen- 
cies of various queries. 



3.2 Definition of Frequency of Queries With More Than One Free Variable 

We assign a domain to every tuple of entity variables in a formula F given a database instance 2?, which 
we denote as domx>{F, {Xi, X^})- Our basic idea is to consider a result tuple (ci, c^) as denoting a 
composite entity formed by combining m single entities. For example, consider the rule Neighbour(X, Y) A 
(3I.Income{X,I) A / > $100,000) ^ {31 .Income{Y, I) A / > $100,00). (The symbol ^ does not denote 
logical implication but defines an association rule; see Section [D) This says that if X has an income over 
$100,000, then it is likely that a neighbour y of X also has an income of $100,000. The support of this rule 
is the frequency of the query Neighbour{X, Y) A {3LIncome{X, I)AI> $100, 000) A {3LIncome{Y, I)AI> 
$100,00). This query has two free variables X and Y. The reference domain comprises the entries in 
the Neighbour table, that is, the pairs {X, Y) in the table. Other examples of natural composite entities 
include relations like reservations or purchases. The idea of treating tuples in a relational table as composite 
"individuals" is familiar in the propositionalization literature [51 (for example, chemical molecules may 
be treated as single entities although molecules are composed of diflFerent elements that are also represented 
in the relational schema). Applying this idea requires a further constraint on ER queries: the free variables 
{Xi, Xm} must be "bound together" in a limiting condition rather than separately. For example, the 
query P{X) A X — Y is a safe ER query but the answer pairs (x, y) are not bound to the key fields of any 
tuple; an example of the same character is the query P{X) A Q{Y). To rule out such cases, we impose the 
following condition. 

Definition 6 A literal is an atomic formula or its negation. An ER query F is valid for variables 

Xi, Xm if for every maximal conjunction L — Li A ... A Lk consisting only of literals, L contains a 
conjunction of the form Xi = ci A • • ■ A Xm — Cm, or L contains a conjunct P{ti, ...,tk) where all variables 
{Xi, Xm\ occur in P{ti, tk). An ER query F is valid if F is valid for the set of its free variables. 

Examples follow below in this section. In the case with only one free query variable X, the definition of 
safe query implies that every entity query is valid. Now let us consider the definition of a reference domain 
for valid ER queries with one or more free variables. As in the case with just one query variable, we term 
domx){F,{Xi, ...,Xm}) the reference domain of {Xi, ...,Xm] in the context of query F. Consider the 
basic case of an atomic formula F = P{ti, ...,tm) first. In keeping with the idea behind safe queries, we 
can think of such formulas as specifying a basic range for the result tuples in a query. So suppose that the 
free variables in the atomic formula are Yi,Y2, ..,Yk. If our query variables Xi, Xm are not all contained 
in the set {Yi, ^2, .., Yfc}, we consider that the "composite key" Xi, Xm does not appear in the query. 
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Query Formula F, Reference Domain domx>{F, X) 
Fi = 3S3V.WeekdayTV{P, SN, V,S)AV>10 

dom-p{Fi,X) = {("Gilm.","Glo."), ("Gilm.", "CBS"), ("Hock. N.","CBC")} 

F2 = 3S3V.WeekendTV{P, SN, V,S)AV>10 

do7nv{F2,X) = {("Gilm.","Glo."), ("Hock. N.", "CBC"), ("Simps.", "CBS"), ("Daily Sh.","CBC")} 
F'g, — Fi A F2 

dom-D{F3,X) = {("Gilm.","Glo."), ("Gilm.", "CBS"), ("Hock. N.","CBC"), ("Simps.", "CBS"), ("Daily Sh.","CBC")} 

Table 12: Reference Domains for various formulas in the TV survey database instance V from Tables [SHU 
The free variables query variables are P and SN, corresponding to pairs of programs-stations. 

and domv{F, {Xi, Xm}) — 0- Otherwise we consider the query result tuplesTi(F) as a relation with 
k columns, of which m are named Xi, ..,Xm- For example, the query p{X,Y, Z) returns a relation with 
triples, and we can think of the first column as named X and the second as named Y. Thus we can take 
''^{Xi,...,x„^)tuplesT){F) to be the reference domain of the entity variables Xi, X2, Xm in the query F. This 
leads to the following inductive definition. The main difference with the definition for a single query variable 
is that we need to treat conjunctions like Xi — ci A ■ ■ ■ X„i = c,„ as a single compound statement. 

Definition 7 Let V be a database instance with ER formula F and let Xi, ..^Xm be a list oj variables. 

1. If F is P{ti, ..,tk), and all variables Xi,..,Xm occur in P{ti, ..,tk) , then domx>{F, {Xi, Xm}) = 
■""{Xi x„^)tuplesx>{F) , where tt is the projection operation of relational algebra. Otherwise domx> 

iF,{x[,7..,Xrn})=^. 

2. Let F be a single atomic comparison of the form YOt where t is either a variable or a constant. 

(a) Suppose that m = \,Y = Xi and the comparison is Xi =c (i.e., we just have a single free variable 
Xi and the atomic formula requires Xi to be equal to a constant c.) In that case domx>{F, {^1}) = 
{c}. 

(b) Otherwise domTi{F, {Xi, X„i}) = 0- 

3. Let F be a maximal conjunction of k > 1 formulas, such that F = Ci A ■ ■ ■ ACk- 

(a) If F is a conjunction of the form C A Xi = ci . . . A X^. = Cm, then domx){F, {Xi, ...,Xm}) = 
dom-D{C, {Xi, ...,X^}) U {(ci, c„i)}. 

(b) Otherwise doTnv{F, {Xi, X^}) = [ji=idomv{Ci,{Xi, ...,X,n}). 

4. IfF isFi\JF2, thendomv{F,{Xl,...,X„,})^domT,{Fl,{Xl,...,X„^])\Jdomv{F2,{Xl,...,X,r,]). 

5. IfF is -iG for some formula G , then domi}{F,{Xi, .... Xm}) = domx>{G,{Xi, Xm] . 

6. If F is 3Y.G, where Y ^ {Xi,...,X^}, then domv{F,{Xi, ...,X^}) = domv{G,{Xi, ...,X^}). If F 
is 3Xi.G for some Xi € {Xi, X„i}, then domTi{F, {Xi, Xm}) — fl- 
it is easy to check that this definition agrees with Definition 2] for queries with just one free variable. 
Examples. Consider the query "find all program-station pairs that achieve a viewership of over 10,000 

on both weekdays and weekends". In the domain relational calculus, this query may be formulated as 
[3S.3V.WeekdayTViP, SN, V,S) AV > 10] A [3S.3V.WeekemdTV{P, SN, V,S)AV> 10]. Table [H] shows 
the calculation of the reference domain for this formula on the database instance of Tables [2H11 
Now the frequency of an ER query is defined as follows. 
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Query Formula F 


Frequency frx>{F) 


Fi = 3S3V.WeekdayTV{P, SN, V,S)AV>10 


2/3 


F2 = 3S3V.WeekendTV{P, SN, V^, S*) A ^ > 10 


1/4 


F1AF2 


1/5 




2/5 


Fi A ^F2 


1/5 



Table 13: Frequencies for various formulas in the TV survey database instance V from Tables [2H11 The free 
variables query variables are P and SN , corresponding to pairs of programs-stations. 



Definition 8 Let F he an ER query whose free variables are Xi, ..;X,„ where domx>{{Xi, Xm}, F) ^ 
Then 

fr^(F) ^ \tuplesv{F)\ 

' \domv{F,{X^,...,Xra})[ 
Table fT3l illustrates the frequencies of various queries. 



4 Entity-Relationship Rules 

We finally obtain the notion of an ER association rule, or ER rule for short. 



4.1 Definition of Confidence and Support for ER rules 

Given the concepts we have developed so far, the definition of confidence and support for an entity- 
relationship rule are straightforward. 

Definition 9 Let V be a database instance. 

1. An ER association rule is an implication oj the form F G, where the free variables of G are the 
same as or contained in the free variables of F, and F A G is a valid ER query. 

2. The confidence of an ER association rule F G is given by 

\tuples-D{F A G)\ 
con^{F^G)= . 

3. The support of an ER association rule F ^ G is given by 

supportv{F ^ G) = frviF A G). 

As usual with association rules, the implication F G does not indicate logical implication (whenever 
F is true, so is G) but instead denotes a probabilistic relationship. 

Example. Let T> be the TV survey database instance from Tables OUl Let Fi be the formula 

3S.3SN.3V.WeekdayTV{P, SN, V,S)AV>10 

and let F2 be the formula 

3S.3SN.3V.WeekendTV{P, SN, V,S)AV> 10. 

Consider the rule Fi ^ F2. The support of this rule is fr'p{Fi AF2) = 1/4 (see Table [TT|) . The confidence is 

I { "Gilmore" , "Hockey Night" } n { "Hockey Night" , "Simpsons" } | _ 

I { "Gilmore" , "Hockey Night" } | ^ ' 

Definition [H] completes our goal of providing a definition of confidence and support for general entity- 
relationship queries. 
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4.2 Comparison With Other Rule Languages 

This section gives a brief comparison of our rule language and frequency definition to related rule languages. 
It is easy to see that the classic association rule approach based on frequent itemsets is a special case. For 
example, suppose we have two entity tables: Transactions ( number ) that stores transactions, and Item(name) 
for items, and a relational table TransItemsf TransNumber JtemName ) that indicates which items appear in 
which transactions. Then for a given item, say "cola" , the query Transactions{X) ATransItems{X , "cola") 
returns the set of transactions involving "cola" , and the frequency of this query is the frequency of these 
transactions among all transactions. 

Antonie and Zaiane [T] extend itemset rules with negations, and survey a number of search algorithms for 
finding frequent itemsets with negative conditions. Their search procedure is based on correlation analysis. 

The Warmr system [4] considers queries that are conjunctions of literals (e.g., P{X,Y)). The user 
specifies a target table T; the free query variables in a Warmr query are then bound to the key fields of T. 
If Weekday TV is our target table, we would have two free query variables P for program and SN for station. 
All other variables are implicitly existentially quantified. For example, if Customer is the target table, the 
Warmr formula Customer{A) AChild{A, C) ABuys{C, "cola") translates into the domain relational calculus 
as 3C. Customer (A) A Child{A, C) A Buys{C, "cola"). If we assume that one of the conjuncts in a Warmr 
clause corresponds to the target table (e.g.. Customer (A)), and all other appearances of the query variables 
are related to the target table by foreign key constraints (e.g., the first field in the Child table is a foreign 
key to the Customer table), then the reference domain as we have defined it is exactly the target table, and 
the frequency that Warmr assigns to a conjunction agrees with our definition. In this sense our definition 
of support for ER rules generalizes that for Warmr rules. 

5 The Probability Axioms and A Priori Property 

In order to ensure that Definition [7] yields well-defined probabilities, we verify three facts: (1) the frequency 
as defined never involves division by 0, so the frequency is well-defined. (2) The definition entails that 
frequencies are between and 1 (inclusively). (3) The frequency of two mutually exclusive queries is the 
sum of their respective frequencies. This third property holds only with certain qualifications due to the 
restrictions on safe queries. The usual probability axioms include the requirement that (4) the probability 
of the whole space, or the "certain event" is 1. We discuss the extent to which this property holds for 
our definition of frequency. Finally, we show the Apriori property: frequencies of conjunctions decrease 
monotonically, which is important for lattice search methods. 

For the first fact, we have the following result. The notion of a valid ER query was specified in Definition 

E 

Proposition 10 Let F be a valid ER query whose free variables are Xi, X„i. Let V be any database 
instance (without empty tables). Then dom-p^F, {Xi, A",„}) ^ 0. 

Proof. If F is valid, then for every maximal conjunction L of literals that occurs in F, we have domT>{L, {Xi, Xm}) 
0. Since the reference domains of more complex formulas are the union of the domains of their subformulas, 
it follows that domv{F, {Xi, X^}) 7^ 0. ■ 

The next proposition guarantees that the ratios assigned by Definition [7] are properly bounded between 
and 1. 

Proposition 11 Let F be an ER query in which the variables Xi, . . . , X^ are free such that F is valid 
for these variables. Let V be a database instance. Then -k i^Xi,...,x^)'t'^P^^S'D{F) C dom-piF, {Xi, . . . , Xm}), 
where tt is the projection operation of relational algebra. 

In the case in which Xi, . . . ,Xm are exactly the free variables of F, we have t^(Xi,. ..,x„^)tuplesi:>{F) = 
tuplesT>{F), so the proposition imphes that the ratio \iioml^{F{xl''^\ })| between and 1. 
Proof. The proof is by induction on the structure of ER formula F. We begin by noting two basic facts 
about valid formulas, which follow easily from Definitions [2l [6l and [71 
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1. If C — Ci/\.../\Ck is a maximal conjunction in _F, then C contains a conjunction Xi = ci . . .AX„i = Cm 
or a conjunct Ci that is a vahd ER query. 

2. If Fi V is a disjunction in F, then both of the disjuncts are vahd ER queries. 

• If F is an atomic formula of the form P{ti,..,tk), then since F is valid for Xi,...,X„i, we have 
domviF, {Xi, ...,X,n}) = TT{Xi,...,x^)tuplesT>(F). 

• Let _F be a single atomic comparison of the form YOt where t is either a variable or a constant. Since 
F is valid, it must be of the form Xi = c where m = 1 (i.e., we just have a single free variable Xi and 
the atomic formula requires Xi to be equal to a constant c). So domx>{F, {Xi}) = {c}, and clearly 
TTXituplesviF) C {c}. 

• Let be a maximal conjunction of > 1 formulas, such that F = Ci A ■ ■ ■ A Ck- 

1. If F is a conjunction of the form C A Xi — ci . . . A Xm — Cm, then domx>{F, {Xi, X„i}) = 
dom-D{C,{Xi,...,Xm})U {{ci,...,Cm)}- Clearly Tr(^Xi,...,x^)tuplesv{F) C {(ci, c™}}, which is 
a subset of dom-piF, {Xi, ...,Xm}). 

2. Otherwise domx>{F, {Xi, ...,Xm}) — UiLi dom-piCi, {Xi, ...,Xm})- Since F is valid, by Observa- 
tion [T] at least one of the conjuncts d is valid. So by inductive hypothesis, 

'^{Xu....x„,)'tuplesx>{Ci) C domviiCi, {Xi, . . .,Xra}). 

Now since -F is a conjunction involving Ci, it follows that 

'^{Xi,...,x^)tuplesv{F) C TT(^Xi,...,x^)tuplesT){Ci) 

and that 

domT>{Ci, {Xi, . . . ,Xm]) ^ domv{F, {Xi, . . . ,Xm}), 
which establishes the inductive hypothesis for this case. 

• If F is Fi V F2, then by Clause [5] of the definition of a safe query, both Fi and F2 are valid and contain 
all the variables {^i, Xm}) as free variables. So 

'^{Xi,...,x^)'tuplesv{F) = Tr(^Xi,...,x^)'tuplesv{Fi) U 7r(Xi,...,x„)iup'es-D(F2). 

Also, by inductive hypothesis, 

'^{Xu...,x^)'tuplesviFi) C domv{Fi, {Xi, . . . , Xm}) 

and 

'^{Xi,...,x„,)'tuplesT>{F2) C domviF2, {Xi, . . . ,X„J), 

and by definition 

domv{F, {Xi, . . .,Xjn}) = donijj{Fi,{Xi, . . . U donijj{F2, {Xi, . . . 

So 

'i^{Xi,...,x^)tuples-D{F) C domT>{F, {Xi,. . . , Xm}) 

as required. 

• If F is -iG for some formula G, then F is not a safe query, hence not an ER query, and the claim holds 
vacuously. 
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• If F is 3Y.G, then Y ^ {Xi, X^}, since the variables Xi, X^ are free in F. So 

domv{F,{Xi,...,Xm}) = domT>{G, {Xi, ...,Xm}), 

and 

■^{Xi,...,x,n}tuplesv{F) = TT(^Xi,...,x^)tuplesT>{G) 

by the semantics of the existential quantifier. Clearly if F is valid, then so is G, so by inductive 
hypothesis 

'^{Xi,...,x^)tuplesv{G) C domv{G, {Xi, . . .,Xra}) 
which completes the inductive proof. 

■ 

The third fundamental property of probabilities is finite additivity, that the frequency of two mutually 
exclusive events is the sum of the individual frequencies. The difficulty with this property is not that 
it fails for our frequency definition, but that it is not straightforwardly expressed in our language of safe 
queries. For example, a natural formulation of finite additivity would be to require that frj){F) + frT>{->F) = 
frTi{F\/^F). But if F is a safe query, then -iF is not safe, so the frequency fr-pi^F) is not defined. Another 
way to see the difficulty is to note that in standard probability theory (with a Boolean algebra of events), 
finite additivity is equivalent to the requirement that Pr{A) = 1 — Pr{A), where A is the complement of 
event A. But this cannot be expressed as a requirement on safe queries since the negation of a safe query is 
not itself safe. 

However, we can show a qualified version of finite additivity. If S and F are valid safe queries with the 
same free variables, then the formulas S A F and S A -iF are also valid safe queries. For these formulas we 

can show the following result. 

Proposition 12 Let S and F be valid safe queries with the same free variables {Xi, ...,Xm}- Then for any 
database instance V we have 



MIS A F] V [5 A 1, ^ ,MS A F) + fMS A .F, ^ ,^^(,, ^ ^ 

Proof. This follows from the definitions: We have dom-v{[S A F] V [5 A -^F], {Xi, = dom-D{[S A 

F],{Xi,...,X^}) U dom-D{[S A ^F],{Xi,...,Xm}), and since dom-viiS A F], {Xi, Xm}) = dom-D{[S A 
-nF], {Xi, ...,Xm}) = dom-viS, {Xi, U domv{F, {Xi, it follows that 



domvilS A F] V [5 A -F], {Xi, = domv{S, {X^, U domp(F, {Xu 
Clearly tuplesT){[S A F] V [5 A ^F]) = tuples-niS), so 

fr^{[SAF]v[SA^F])- tuples^S) 



domviS, {Xi, Xm}) U dom-DiF, {Xi, ' 

AT r f Q A ]7\ tuples-rr {SAF) i/. / c a p\ lupleST} (5A-iF) 

AiSO, jrT>{0/\-f ) — dom^:,{S,{X-,,...,Xr„,})^Jdom-D{F,{Xl,...,Xrr^}) ana jr© p A^l^ j — domi,(S',{Xi,...,X„})U(iomi,(_F,{Xi,...,X„}) ' 



SO 

frT,{S A F) + frv{S A -F) = — — tuphsj^jS) 

■' ^ ' ' dom'D{S,{X^,...,Xm})^d(ymv{F,{X^,...,Xm]) 

which was to be shown. ■ 

This result illustrates that two logically equivalent queries can have different frequencies in a given 
database instance, although their result tuples are always the same. In particular, although the queries 
[5* A F] V [S* A -iF] and S are logically equivalent, they have different reference domains: the domain of 
[S' A F] V [5 A -iF] includes also the domain of the query F. This is due to our closed- world assumption: since 
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the entities in the query F are among those mentioned in the query [S* A i^] V [S' A -^F] , they are included 
among the potential answers to the query, although in fact no entity satisfying F will be an actual answer 
to the query unless it is also an entity satisfying S. 

The final standard property of probability measures on a Boolean algebra is that P{X) = 1 , where X is the 
"certain event" that contains all possible outcomes. One difficulty with this property from the point of view of 
our frequency definition is again not so much that the property fails to hold but that it is not straightforward 
to express. A natural way to translate the axiom into a logical framework is to require that all tautologies 
or logically necessary queries receive probability 1. For example the query Student(A) V -^Student(A) is a 
tautology when viewed as a logical formula, but it is not a safe query. Another conceptually illuminating 
difficulty is that in our frequency definition, there is no single fixed space of possible outcomes or events 
that is independent of the query being asked. Rather, we define a space of possible outcomes dynamically 
for every query (i.e., domx>{F, {Xi, ...,Xm}) for query F). For a given reference domain, the probability 1 
property holds to the extent that we can express it. For example, if the only two possible genders are male 
and female, then the query [Student(A) A Gender(A, male)] V [Student{A) A Gender {A, female) receives 
frequency 1 in every database instance. 

Finally we show that frequency as defined decreases monotonically with respect to conjunctions. This 
is important because many algorithms that search for frequent query formulas use this property to avoid 
exhaustive search. The following result guarantees that the frequency of a conjunction is less than the 
frequency of its conjuncts, which we refer to as the Apriori property. 

Proposition 13 (The Apriori Property) Let V be a database instance with valid ER query Fi whose 
free variables are Xi,..,Xm and suppose that Fi A F2 is also a valid ER query whose free variables are 
Xi, Then /rp(Fi A F2) < frv{Fi). 

Proof. Clearly 

tuplesviFi A F2) C tuplesT>{Fi) , 

and 

domv{Fi, {Xi, .., Xm}) C dom-D{Fi A F2, {Xi, .., 

So 

\tuplesT>{Fi A ^ \tuples-p{Fi)\ 

|domc(Fi Ai^^2,{Ai,..,X™})| - \domviFi,{Xi,..,X,n})\' 

■ 

Discussion. Previous approaches to mining multi-relational rules such as Warmr mine rules for just one 
target table. Our approach in contrast can potentially search the entire space of queries for a given language 
bias, since by the proposition just established, the a priori property holds for the entire query space, not 
just for a fixed target table or key atom, given our definition of frequency and support. So compared to an 
iterative approach where we repeatedly apply a single-table rule miner to different tables in the database, 
our approach offers computational advantages. Intuitively, our approach combines the results of rule mining 
for separate tables when it considers rules that involve the separate tables at the same time. For example, 
suppose that for the Student table, we find that the query Student(X) AAge{X, 30) is infrequent. Then from 
Proposition [T51 we can conclude that the query Student{X) A Age{X,30) A Professor{X) is infrequent as 
well. A traditional single-table rule mining system applied to both target tables would have to evaluate this 
conjunction twice, once with the target table Student and the second time with the target table Professor. 

The price for the computational advantage of the a priori property holding throughout the query space is 
that our approach restricts the set of interesting queries compared to an iterative application of single-table 
rule mining. For example, it may be the case that the rule Professor{X) A Student{X) — > Age{X, 30) 
receives enough support if evaluated with respect to Professors (because it may be the case that most 
professors who are also taking courses as students are younger), but does not receive enough support if 
evaluated with respect to Students (perhaps because very few students are also professors to begin with). 
Our definition of support based on taking the union of the database tables can be seen as a cautious approach 
because if a query is frequent with respect to the union of two tables, it is frequent with respect to either 
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tabic. So a query that is frequent with respect to the union of the Professor and Student tables is frequent 
with respect to both. 

6 Conclusion 

The goal of this report was to extend the concept of confidence and support for a new class of association rules 
which we call entity-relationship rules. Entity-relationship rules are based on the domain relational calculus; 
they are much more flexible and expressive than standard itemset rules. ER rules allow for negation, nested 
Boolean combinations, and quantification.The main conceptual contribution of this report is a definition of 
frequency for entity-relationship queries. Instead of beginning with a specified target table or "key atom" , 
we dynamically define a reference or base domain of individuals for each ER query. The key idea of our 
definition is to take the base set of entities of a conjunctive query to be the union of the conjuncts' base 
sets. For example, the frequency of the query Profes.sor{X) A Customer{X) is computed with respect to 
the union of Professors and Customers. We proved that our frequency definition satisfies standard axioms 
for probabilities and validates the APRIORI property: the frequency of a conjunction is no greater than the 
frequency of any conjunct. 

As usual in data mining, there is a tradeoff between the expressiveness of the rule or pattern language, 
and the diSiculty of searching for significant patterns. Our rule language is very general and in practice a 
computational search for interesting entity-relationship rules will require a language restriction (bias). A 
central topic for future research is to explore language restrictions that make feasible a computational search 
for interesting entity-relationship rules. 
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